Skip to content

Conversation

@rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented May 14, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Description:

  • Single source of truth for transitions

    • All components now use sonic_platform_base.module_base.ModuleBase helpers:

      • set_module_state_transition(db, name, transition_type)
      • clear_module_state_transition(db, name)
      • get_module_state_transition(db, name) -> dict
      • is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
    • Eliminates duplicated logic and race-prone direct Redis writes.

  • Correct table everywhere

    • Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
    • HLD mismatch addressed in code (HLD fix tracked separately).
  • Ownership & lifecycle

    • The initiator of an operation (startup/shutdown/reboot) sets:

      • state_transition_in_progress=True
      • transition_type=<op>
      • transition_start_time=<utc-iso8601>
    • The platform (set_admin_state()) is responsible for clearing:

      • state_transition_in_progress=False
      • optionally transition_end_time=<epoch> (or similar end stamp).
    • CLI pre-clears only when a prior transition is timed out.

  • Timeouts & policy

    • Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

    • Typical production values used:

      • startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
    • Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

  • Boot behavior

    • chassisd on start:

      1. Clears stale flags once (centralized sweep).
      2. Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
      3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
  • gNOI shutdown daemon

    • Listens on CHASSIS_MODULE_TABLE and triggers only when:

      • state_transition_in_progress=True and transition_type=shutdown.
    • Never clears the flag (ownership stays with the platform).

    • Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

  • CLI (config chassis modules …)

    • Uses ModuleBase APIs for all set/get/timeout checks.
    • If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
    • Sets transition at the start of startup/shutdown; platform clears on completion.
    • Fabric card flow retained; edits are surgical.
  • Redis robustness

    • Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
    • Consistent HGETALL/HSET paths; resilient to connector differences.
  • Race reduction & consistency

    • Centralized writes prevent multi-writer races.
    • All transition writes include transition_start_time; clears may add an end stamp.
    • Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
  • Change scope

    • Minimal, targeted diffs.
    • No background tasks added, no broad refactors beyond transition handling.
    • Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw
Copy link
Contributor

hdwhdw commented May 19, 2025

Do you mind pasting the steps and output for testing (commands) in the PR description

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot finished reviewing on behalf of vvolam November 13, 2025 18:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from qiluo-msft November 14, 2025 18:06
@vvolam
Copy link
Contributor

vvolam commented Nov 14, 2025

@qiluo-msft could you please review this new service addition to sonic-host-services?

capture_output=True,
text=True,
timeout=5
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use native python for this purpose?

  from swsscommon import swsscommon

  config_db = swsscommon.ConfigDBConnector()
  config_db.connect()
  entry = config_db.get_entry('DEVICE_METADATA', 'localhost')
  subtype = entry.get('subtype') if entry else None

  If you need the exact same behavior as the subprocess (string
  output):
  from swsscommon import swsscommon

  config_db = swsscommon.ConfigDBConnector()
  config_db.connect()
  entry = config_db.get_entry('DEVICE_METADATA', 'localhost')
  result = entry.get('subtype', '') if entry else ''

  With error handling like the original subprocess:
  from swsscommon import swsscommon

  try:
      config_db = swsscommon.ConfigDBConnector()
      config_db.connect()
      entry = config_db.get_entry('DEVICE_METADATA', 'localhost')
      subtype = entry.get('subtype') if entry else None
  except Exception:
      subtype = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdwhdw I have simplified it even further please take a look.

# gNOI helpers
# ############

def execute_gnoi_command(command_args, timeout_sec=REBOOT_RPC_TIMEOUT_SEC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy in that case maybe at least rename the function to execute_command instead of execute_gnoi_command?

'scripts/determine-reboot-cause',
'scripts/process-reboot-cause',
'scripts/check_platform.py',
'scripts/wait-for-sonic-core.sh',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy why are we not copying gnoi_shutdown_daemon.py? Is the PR tested end to end? If yes, could you share the gnmi.logs with reboot call log samples?

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolum Looks like it is missed, thanks! Tested it locally, please find below the results.

gNOI halt request failing case: using DPU1

2025 Nov 15 09:04:11.933165 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU1: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 09:04:11.933340 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 09:04:11.933439 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU1: Starting gNOI shutdown sequence
2025 Nov 15 09:04:11.933557 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: Starting gNOI shutdown sequence
2025 Nov 15 09:04:12.352564 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU1: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 09:04:12.352701 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 09:05:29.396271 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: gNOI sequence failed

gNOI halt request normal passing case: using DPU2

2025 Nov 15 10:02:52.643878 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 10:02:52.644048 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: Starting gNOI shutdown sequence
2025 Nov 15 10:02:52.644156 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 10:02:52.644212 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: Starting gNOI shutdown sequence
2025 Nov 15 10:02:53.030039 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 10:02:53.030204 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 10:03:47.303558 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: boot halt success rc: 0 out_s:System RebootStatus#012{"reason":"Halt reboot completed","count":1,"method":3,"status":{"status":1}}
2025 Nov 15 10:03:47.303710 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: boot halt success rc: 0 out_s:System RebootStatus
2025 Nov 15 10:03:47.304954 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: gNOI sequence completed
2025 Nov 15 10:03:47.305021 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: gNOI sequence completed

Name Description Physical-Slot Oper-Status Admin-Status Serial


DPU0 N/A N/A Offline down N/A
DPU1 AMD Pensando DSC N/A Offline down FLM281704EZ-1
DPU2 AMD Pensando DSC N/A Offline down FLM281704EK-0
DPU3 N/A N/A Offline down N/A
DPU4 AMD Pensando DSC N/A Online up FLM281704EM-0
DPU5 AMD Pensando DSC N/A Online up FLM281704EM-1
DPU6 AMD Pensando DSC N/A Online up FLM281704EU-0
DPU7 N/A N/A Offline down N/A

Note:

Make sure

  1. gnoi_shutdown_daemon.py is present in /usr/local/bin
  2. Make sure the daemon is running
  3. For MtFuji the platform module.py should have the following until this feature is committed.
    def module_pre_shutdown(self):
    return True

@hdwhdw hdwhdw self-requested a review November 14, 2025 23:25
@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants